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Abstract 

We show how machine learning techniques based on Bayesian inference 
can be used to reach new levels of realism in the computer simulation 
of molecular materials, focusing here on water. We train our machine- 
learning algorithm using accurate, correlated quantum chemistry, and 
predict energies and forces in molecular aggregates ranging from clusters 
to solid and liquid phases. The widely used electronic-structure methods 
based on density-functional theory (DFT) give poor accuracy for molec- 
ular materials like water, and we show how our techniques can be used 
to generate systematically improvable corrections to DFT. The resulting 
corrected DFT scheme gives remarkably accurate predictions for the rel- 
ative energies of small water clusters and of different ice structures, and 
greatly improves the description of the structure and dynamics of liquid 
water. 

1 Introduction 

The computer simulation of materials has become an indispensable tool across 
a wide range of disciplines, including chemistry, metallurgy, the earth sciences, 
surface science and biology. Simulation techniques range all the way from simple 
empirical force fields to the electronic structure methods based on density func- 
tional theory (DFT) and correlated quantum chemistry [I] . Electronic structure 
methods are capable of much greater accuracy and generality than force fields, 
but their computational demands are heavier by many orders of magnitude. A 
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crucial challenge for simulation is therefore to find systematically improvable 
methods for casting information from accurate electronic-structure techniques 
into forms that are more rapidly computable. We show here how machine learn- 
ing techniques [2] allow this to be done using correlated quantum chemistry for 
molecular materials, taking condensed-phase water as our example. 

The fundamental interactions in water and other molecular materials |3] con- 
sist of exchange-repulsion, electrostatic interaction between molecular charge 
distributions, polarization (i.e. the electrostatic distortion of charge distribu- 
tions), charge transfer and van der Waals dispersion, together with effects due to 
molecular flexibility. Electron correlation plays a role in all these, and is crucial 
for dispersion [3]. The correlated quantum chemistry methods of MP2 (2nd- 
order M0ller-Plesset) and particularly CCSD(T) (coupled-cluster with single 
and double excitations and perturbative triples) [5] give a very accurate de- 
scription of these interactions [6l [7], but their heavy computational demands 
for extended systems make their routine use for condensed matter problematic. 
DFT techniques are less demanding, and have been widely used for water [5], 
but it has been found that the results with standard approximations often agree 
poorly with experiment [S] and may depend strongly on the assumed approx- 
imation |10j . There is vigorous debate about how to overcome the problems, 
and we believe that input from correlated quantum chemistry is essential. In 
the approach described here, machine learning [5] is used to construct repre- 
sentations of the energy differences between correlated quantum chemistry and 
DFT, which can then be used to construct efficient corrected DFT schemes for 
simulation of large, complex systems. 

Our machine-learning methods are based on the recently reported ideas of 
Gaussian Approximation Potentials (GAP) [TT1[T2]. For molecular materials, 
we use these ideas in the framework of the widely used many-body represen- 
tation [13 m], in which the total energy i?tot(l, 2, . . . iV) of a system of N 
molecules is separated into 1-body, 2-body and beyond- 2-body parts: 

N 

£;tot(l,...A^) =5I^lBW + E^2B(*,j)+i^B2B(l,..-A^) ■ (1) 
i—1 i<j 

Here, £'ib(«) is the 1-body energy of molecule i in free space, which depends 
on its distortion away from its equilibrium configuration. The energy E2B{i,j) 
is the 2-body interaction energy of the pair of molecules (i, j) in free space, 
i.e. the total energy of the pair minus the sum of their 1-body energies. For 
water, E2Bii,j) is a function of 12 variables specifying the separation of the 
molecules, their relative orientation and their internal distortions. The beyond- 
2-body (B2B) energy Eb2B represents everything not accounted for by 1- and 
2-body energies. Exchange-repulsion, Ist-order electrostatics and dispersion are 
mainly or entirely 2-body interactions, while Eb2B arises mainly from polariza- 
tion and perhaps charge transfer, with contributions from exchange-repulsion 
and dispersion expected to be smaller ,15] . The ideas in the present paper build 
on the pioneering works in the groups of Szalewicz [16] and Bowman [17] . 

The energy of an isolated H2O monomer as a function of distortion is known 
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from ab initio calculations to extremely high precision. We use the parame- 
terization due to Partridge and Schwenke (PS) [H], which can be regarded as 
essentially exact for our purposes. In water and many other molecular materi- 
als, the 2-body interactions give the largest contribution to the binding energy, 
so these must be accurately represented [H]. However, in water, the B2B in- 
teractions also play a crucial role [TB]. It has long been known that there is a 
strong redistribution of electrons when a water monomer enters the liquid or 
solid phases, and its dipole moment increases from 1.8 D in the gas phase to 
~ 2.5 D or more in condensed phases [20]. The B2B terms in the energy come 
partly from the large changes of electrostatic interactions due to this redistribu- 
tion. In the strategy presented here, the B2B part of the energy is represented 
by a chosen form of DFT, because DFT can be expected to include most of 
the physical effects that play a role in the B2B energy. We will discuss at the 
end how such an approximation to B2B interactions can be further improved. 
We note here that we fully include molecular flexibility, which is not always 
done even in sophisticated empirical interaction models. Flexibility is essential, 
because without it one could not describe the well known lengthening of 0-H 
bonds and the lowering of 0-H vibrational frequency [2T] when the H atom par- 
ticipates in a hydrogen bond with another monomer, and nor could quantum 
nuclear effects [55] be properly treated. 



2 Machine learning with GAP 

Consider a system whose configuration is specified by points R in a many- 
dimensional configuration space. We are given the values f(Rn) of its energy at 
a finite set of configurations {R„}. We now ask: what is the most likely value 
of /(R) at a configuration R not in the given set {R„}? The rules of Bayesian 
inference [2] are used to compute this most likely value, assuming that the 
function / has certain smoothness properties. Here smoothness simply means 
that the probability of finding very different values /(R) and /(R') decreases 
rapidly to zero as R and R' approach each other. The framework of GAP, based 
on Gaussian processes |23J, uses a precise formulation of smoothness in terms of 
a covariance function C(R, R') having the form (25j : 

C(R,R') = 0exp (-E[(^' - ^^)/(2'^^)]') ' (2) 

where the sum in the exponent is over the dimensions of the configuration space, 
6 is the typical scale of / and a is the typical length scale on which /(R) varies. 
The theory yields the following formula 23 for the most likely estimate of /(R) 
given the data and the assumption of smoothness (often called the maximum a 
posteriori estimator): 

data 

/(R) = ^C(R,R„)a„, (3) 

n 
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where the coefficients are given by inversion of the hnear equations: 



data 



/(R,„) = ^[C(R,„,R„) + e'5 



(4) 



n 



where Smn is the Kronecker delta; the diagonal shift of magnitude e is included 
to regularise the linear algebra. 

When applying GAP to represent 1- and 2-body energetics in water, there are 
different ways of choosing the space of points R representing configurations, but 
here it is advantageous to build in the fact that the energy function /(R) is left 
unchanged by rotations and translations of the whole system, and by interchange 
of identical atoms. For the water monomer, the two OH distances and the angle 
between them provide a convenient coordinate system. For the water dimer, 
we ensure rotation and translation symmetry by working with the space of the 
15 interatomic distances, R — {\ri — rj|}, where are the atomic positions. 
To ensure interchange symmetry, we symmetrize the covariance function over 
permutations of identical atoms: 



where S is the permutation group of the water dimer, whose order \S\ is 8. 
A more detailed account of our GAP formalism is given in [TT] and [H]. The 
computational cost of evaluating the GAP model is linear in the size of the 
database {Rn}, and for the present case takes about 10 ms on a single processor. 

Recently, Gaussian processes were used to model the atomisation energies of 
small molecules [21], and a similar technique (based on neural networks) was used 
to fit the total energy of the water dimer in the DFT approximation [251 . Rather 
than using GAP to represent the whole dimer energy, we use it to represent 
the difference between basis-set converged CCSD(T) and DFT, in other words 
the DFT error. In practice, we do this in two stages. In the first stage, we 
compute the difference between MP2 and DFT at about ten thousand dimer 
configurations, using the AVTZ basis. In using GAP to represent this difference, 
we also use the gradients of the potential energy surface (see SI) . In the second 
stage, we construct a GAP model using around one thousand energies (without 
forces) to represent the small difference between MP2/AVTZ and CCSD(T), 
the latter converged with respect to the basis set to about 1 meV. 

In the calculations to be presented, we use GAP to correct the popular 
Becke-Lee- Yang-Parr (BLYP) approximation of DFT, our choice being guided 
by the fact that BLYP is accurate for the B2B energy of small water clusters 
(see SI) . We show in Fig. 1 the 2-body errors of BLYP together with the errors 
of GAP-corrected BLYP for a thermal sample of dimer configurations (which 
were not used in the construction the GAP model) drawn from a molecular 
dynamics simulation of liquid water. Uncorrected BLYP is too repulsive for the 
water dimer, with unacceptably large errors of up to 50 meV at the separations 
of interest. However, with GAP corrections, the errors are dramatically reduced 
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to ^ 1 meV. GAP thus provides a way of virtually eliminating all errors in a 
chosen DFT approximation apart from those associated with B2B energy. Also 
shown in Fig. 1 are the errors of the approximation obtained by the popular 
procedure of adding the dispersion correction due to Grimme et al. [26] to BLYP. 
We note that this approximation is better than uncorrected BLYP, but is much 
less good than GAP-corrected BLYP. 

3 Results 

To illustrate the power of our GAP-corrected DFT, we start with a simple test 
on the ten stationary points of the water dimer [27]. These form a canonical 
set of configurations, which have been exhaustively studied and whose energies 
are extremely accurately known [6] . The global minimum structure is bound by 
a single hydrogen bond, but some of the less stable structures have up to four 
weaker hydrogen bonds. (For pictures of the structures, see e.g. Ref. [5].) We 
stress that none of these stationary points is included in our training set, so that 
the energies computed with BLYP + GAP are genuine predictions. We compare 
in Table 1 the relative energies of the 10 configurations from BLYP + GAP with 
the almost exact results from Ref. [6j and the predictions of the DFT functional 
BLYP; the Table also includes the very accurate predictions of diffusion Monte 
Carlo (DMG) [15] • As has been reported before [55] j the DFT approximation 
shows quite large errors of around 30 meV in some cases, while the errors of 
DMC are much smaller, being almost all less than 3 meV. Our BLYP + GAP 
predictions are very accurate, and indeed they compete in accuracy with DMC, 
at enormously reduced computational cost. 

As a second test, we examine the predictions of BLYP + GAP for the energies 
of different isomers of the water hexamer. This system has been studied exten- 
sively for many years [HI dOl [7] , for a very important reason. The most stable 
structures of small water clusters from the trimer to the pentamer have a ring- 
like form in which each monomer is hydrogen bonded to two neighbours |30] . 
However, from the hexamer onwards, rings are less stable than compact struc- 
tures in which some monomers are hydrogen bonded to three or four neigh- 
bours [Sni HH n\ . The energy balance for the hexamer is rather delicate, but 
high-precision CCSD(T) calculations leave no doubt that the compact prism and 
cage structures have lower energy than the more open book and ring forms [7]. 
However, many of the commonly used DFT approximations, including BLYP 
and PBE, wrongly predict that the ring or book form is most stable [52]. We 
compare in Figure [2] the predictions of BLYP + GAP with CCSD(T) bench- 
marks and with the predictions of BLYP and DMC. We see that again the 
GAP-corrected DFT model is highly accurate and is comparable to DMC. 

For any material, crystal energetics provides a crucial test of modelling tech- 
niques. Water has a remarkably rich phase diagram, with no fewer than fifteen 
known ice structures [3S1 131]. In the common form ice Ih, found at ambient 
pressure, each H2O monomer is H-bonded to four nearest neighbours at an 0-0 
distance of 2.75 A, the next-nearest neighbours having the much greater 0-0 
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separation of 4.5 A. The pattern of H-bonding in ice Ih is disordered, but the 
closely related ordered form ice XI, stable below 72 K, has essentially the same 
local geometry [33]. With increasing pressure, denser structures become more 
stable, and we will be concerned here with (in order of increasing density) ice 
IX, II, XV and VIII. The distances to non-H-bonded next-nearest neighbours 
decrease along this series, becoming almost exactly equal to the first-neighbour 
distance in ice VIII [S3]. There are accurate experimental values for the zero- 
pressure energies and volumes of almost all these structures. 

Standard DFT approximations perform poorly for ice [35], making the en- 
ergies increase far too much from ice Ih to VIII, and giving transition pressures 
too high by up to a factor of 10. Our machine-learning techniques allow us 
to correct any DFT approximation for 1- and 2-body errors, and enable us to 
discover whether these errors are responsible for the poor description of ice en- 
ergetics. We have used BLYP -I- GAP to calculate the relaxed geometries and 
the equilibrium energies and volumes of the ice structures mentioned above, 
substituting the periodic Bernal and Fowler structure for the proton-disordered 
Ih [36j. The results are reported in Table [5] where we also give results obtained 
with uncorrected BLYP. The energies and volumes relative to ice Ih are very ac- 
curately given by BLYP + GAP, whereas the relative energies with uncorrected 
BLYP suffer the large errors reported earlier for standard DFT methods [55] . 
However, our results also reveal a significant uniform overbinding in all the 
structures due to beyond-2-body errors, implying that BLYP -I- GAP overesti- 
mates the strength of cooperative H-bonding, leading to an overestimate of the 
equilibrium density by about 5-10%. The systematic overestimation of density 
by BLYP + GAP would be partially compensated by zero-point effects, which 
increase volumes by between 1-5% [57] . 

Early attempts to elucidate the properties of liquid water on the molecular 
level using DFT were promising [S], but it has turned out that the standard 
methods give surprisingly poor predictions [9], for reasons that are presumably 
linked to the problems found with ice structures. Other molecular liquids seem 
to suffer from related difficulties |38| . Some of the common approximate func- 
tional give an equilibrium density of water that is too low by as much as 30% 
(PBE and BLYP underestimate it by ~10% and 10-20% respectively) [TUIISS]. 
In DFT simulations of the liquid at the experimental density, the diffusion coef- 
ficient may be too low by as much as a factor of 10, and the liquid is significantly 
overstructured [TU] as compared with experimental neutron and x-ray diffraction 
data. Quite apart from the contradictions with experiment, it is now clear that 
the DFT simulations may be afflicted by many technical sources of error, in- 
cluding system size effects, basis-set incompleteness, incorrect temperature con- 
trol, and the neglect of quantum nuclear effects. Because of the controversies, 
we approach the liquid with caution, particularly since our machine-learning 
methods for going beyond DFT do not yet account for errors in B2B interac- 
tions. Nevertheless, we can address an important and well defined question: 
With machine-learning used to ensure the correctness of IB and 2B parts of 
the energy, can DFT errors in the B2B energies still cause contradictions with 
experiment? 
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We performed molecular dynamics simulations on 64 molecules of liquid 
heavy water (D2O) with BLYP + GAP at temperature 308 K and density 
1.109 g cm^"^. The thermal average pressure in the simulation was —2.6 kbar. 
By contrast, a simulation based on BLYP itself under exactly the same condi- 
tions gives the larger pressure of 7 kbar, which is associated with the 10-20% 
underestimate of equilibrium density. A simple estimate based on our observed 
pressure together with the experimental compressibility indicates that the equi- 
librium density with BLYP + GAP is higher than the correct value by ~ 10%, 
which is consistent with the uniform overbinding observed above for the ice 
structures. 

We compute the self-diffusion coefficient D of molecules in our simulation 
in the conventional way |40j from the slope of the time-dependent mean-square 
displacement (for details, see SI), finding the value 1.3 x 10~® m^ s~^. In 
a system of only 64 molecules, the value of D is expected to be reduced by 
size effects by an amount that can be estimated by standard methods [JT]. 
Our size-corrected value (see SI) of 1.7 x 10~^ m^ s~^ should be compared 
with the experimental value of 2.4 x 10^^ ni^ s^"'^ This contrasts with 

the values reported for BLYP itself, which are up to an order magnitude too 
small [ini HH |44] . Once again, 2-body effects appear to be the main culprit in 
making BLYP unrealistic. 

The well known DFT errors of overstructuring in liquid water are most 
clearly seen in the oxygen-oxygen radial distribution function 5oo('")- A com- 
parison of the experimental and computed RDFs (using BLYP -t- GAP and 
uncorrected BLYP — see Fig. [s]) shows that the GAP correction very signifi- 
cantly improves agreement with experiment. The overstructuring of the liquid 
is largely corrected: the first peak in gooi^) is lowered by 0.25 and the first 
trough becomes shallower by ~ 0.2. However, our comparison with experimen- 
tal data indicates that the liquid is probably still slightly overstructured even 
with BLYP + GAP. We have chosen to compare here with a joint refinement of 
both neutron and x-ray data [45], but one should note the extensive discussion 
in the literature about the uncertainties in both kinds of experimental data. 
We attempt to give a balanced summary of this discussion in the SI. In addi- 
tion, since our simulation treats the nuclei as classical particles, it is essential 
to consider quantum nuclear corrections. Our discussion of the experimental 
and theoretical evidence about these (see SI) indicates that they may lower the 
first peak height oi gooi^) by ~ 0.1. Our assessment that water simulated with 
BLYP -I- GAP is slightly overstructured takes account of both the experimental 
uncertainties and the quantum nuclear effects. 

4 Discussion and conclusions 

The machine-learning methods we have described offer a new approach to the 
modelling of molecular materials. To show the possibilities, we have discussed 
only a single material (water) and we have focused on the systematic improve- 
ment of the 1- and 2-body parts of the energy. Furthermore, the use of machine 
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learning to represent the difference between DFT and accurate quantum chem- 
istry (and thus incurring the cost of DFT when running a simulation) is only 
one of several possible strategies. There is ample scope for removing these limi- 
tations in the future. Note, however, what has already been achieved even with 
these limitations. The description of water energetics for aggregation states 
ranging from clusters to extended solid and liquid states is challenging because 
of the delicate balance between different components of the energy. We have 
shown how to use machine learning to construct a quantitative correction for 1-, 
2-body components of the energy, starting from a chosen DFT approximation 
(in this case, BLYP). We have achieved a substantial improvement in the de- 
scription of cluster and ice-phase energetics and liquid-state properties, with a 
completely negligible increase in computational cost, and we have also revealed 
a remaining inaccuracy in the beyond- 2-body energetics, which we attribute to 
exaggerated cooperativity of H-bonding. 

The most obvious generalisation now needed is to use machine learning to 
systematically improve the beyond-2-body energetics. In fact, GAP-based ma- 
chine learning has already been used with great success to represent many-body 
energies in other types of systems [12] , and there is every reason to expect that 
this can be done for molecular systems. As a first step, we plan to use accurate 
beyond-2-body energies generated for large samples of configurations of small 
and moderate sized clusters, as we have already done in a recent paper [55]. 
Benchmark information from quantum Monte Carlo for configuration samples 
of larger clusters and for the liquid may also be helpful. 

Our strategy of using GAP to represent corrections to be added to DFT is 
powerful in terms of generality, but large gains in computational efficiency could 
be achieved with other strategies. The use of GAP to represent corrections 
to empirical force fields, particularly those that already account for molecular 
flexibility and for many-body effects (see e.g. Ref. [^ ITfl HHj), holds much 
promise for the future. However, we note that although such models already 
exist for water, as a result of many years of development effort, the same is true 
of very few other molecular systems. 

We have focused here on water, because of its outstanding importance for 
science and society, but the present methods should be readily applicable to 
many other molecular systems, even without the technical improvements we 
have mentioned. The cluster, solid and liquid states of simple molecules such as 
ammonia and hydrogen fluoride are obvious targets, but many kinds of mixture, 
including the environmentally and industrially important gas hydrates, are also 
accessible. 

5 Materials and Methods 

All calculations on molecules and isolated molecular assemblies were carried out 
using the Molpro package [5U|. The water dimer configurations were obtain from 
a long equilibrium simulation of liquid water using the AMOEBA force fieldjSl]. 
We used the Castep code [52j for the DFT calculations on ice structures, and 



8 



a modified version of the VASP code [53] for the molecular dynamics simulations 
of liquid heavy water (D2O). Here, the number of molecules per unit volume is 
the same as for H2O at 0.997 g cm~'^. Further technical details about basis-set 
completeness, time-step, temperature control etc are given in the Supplementary 
Information (SI), but we note that thermal averages were computed on a run 
of 45 ps duration of which 20 ps was discarded for equilibration. 



Software and data are available at http://www.libatoms.org 
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Figure 1: Top: Benchmark 2-body interaction (black) and errors of DFT with 
BLYP functional (blue) plotted against oxygen-oxygen distance i?oo- Bench- 
marks are computed with CCSD(T) close to the basis-set limit. Bottom: 2-body 
errors of BLYP (blue, same as in top panel) and errors of BLYP-f-GAP (red). 
The RMS deviation of BLYP+GAP from benchmarks is 0.45 meV. Also shown 
are errors of BLYP plus Grimme D3 dispersion correction The sample of 
500 dimer configurations shown here were drawn from a molecular dynamics 
simulation of liquid water and not used in the construction of the GAP models. 
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Figure 2: Relative energies (meV units) of four isomers of the water hexamer 
(atomic coordinates from Ref. [52]) computed with CCSD(T) close to the basis- 
set limit [7], diffusion Monte Carlo [32], BLYP and BLYP+GAP. Geometries 
are depicted below the Figure. 




Figure 3: Oxygen-oxygen radial distribution function of liquid water at 308 K 
at experimental density using BLYP (blue dashed) and BLYP+GAP (red solid) 
compared with two sets of experimental data (black dotted and dash-dotted). 
Experimental data are from joint refinement of neutron data and two sets of 
x-ray data, identified as HASYlab and PCCP (see Ref. 05] for details). 
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Table 2: Binding energies and volumes of ice polymorphs computing using DFT 
with the BLYP functional and BLYP+GAP compared with experimental val- 
ues |45| . Zero-point vibrational contributions have been removed from the ex- 
perimental energies [JHIj but not from the volumes. 
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24.6 
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23.8 


18.6 
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30.6 
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